Exploring Key Factors Influencing GitHub Repository Popularity¶

Amy Chen, Junhao Liang, Xiaofei Sun, Dennis Truong Group 14 STAT 301 2024W1

In [1]:
# Main developer: Amy
# Contributor: Xiaofei, Junhao

# Load library
suppressPackageStartupMessages({
    library(readr)
    library(rsample)
    library(broom)
    library(latex2exp)
    library(tidymodels)
    library(repr)
    library(gridExtra)
    library(faraway)
    library(mltools)
    library(leaps)
    library(glmnet)
    library(cowplot)
    library(tidyverse)
    library(modelr)
    library(car)
})

Introduction¶

GitHub is the largest platform for hosting and managing source code repositories, heavily influencing how developers collaborate globally. It is an important tool for open-source projects, providing tools that facilitate sharing, contributing, and tracking code changes. One of GitHub’s key features is its ability to showcase a repository’s popularity through the "star" metric. Stars can be a measure of user appreciation, and also reflect the utility and adoption of a repository within the developer community. This metric holds significant value for repository owners and potential stakeholders as it informs critical decisions such as marketing strategies, prioritizing feature development, or evaluating the viability of a business venture for popular tools.

Because of the critical decisions that rely on popularity, others have tried to analyze what makes a repository popular. In 2016, Borges et al. explorated factors that influence GitHub repository popularity over a temporal scale. They found a strong correlation between the number of stars and the number of forks a repository received, suggesting that repositories widely forked (copied and reused) are often more popular. However, their findings showed that metrics like the number of commits or contributors had weaker correlations, indicating that mere activity or the size of a development team does not guarantee popularity. Furthermore, Borges et al. observed that repositories tend to gain stars rapidly following new public releases.

In 2018, Borges et al. built on their earlier work by employing multiple linear regression to predict the future rankings of repositories. Their model strongly predicted future top 10 rankings with error ranging from 0 to 2 positions. However, this study also had limitations. The authors did not expand on how their predictive model arrived at its results or which factors played a more significant role in determining popularity. This lack of transparency leaves a knowledge gap about the underlying features driving the model’s decision making, and thus a repository’s popularity.

These studies focused on a narrow set of features and did not explore the broader range of factors that might contribute to repository popularity. To address these gaps, our research aims to investigate what features most strongly influence the popularity of GitHub repositories. Specifically, we aim to use the number of stars as a measure of popularity and conduct a detailed statistical analysis to identify key determinants using a wider feature set.

For this purpose, we utilize a new dataset uploaded in 2023 by Canard, containing over 215,000 observations of repositories and each with at least 167 stars. The dataset captures critical features such as the number of forks, open issues, watchers, the presence of a Wiki, and more. The data was gathered using the GitHub Search API, which allows querying repositories based on their star count. Due to API limitations that restrict queries to fewer than 1,000 repositories, they divided the queries into smaller subsets by specifying star count ranges. This method enabled them to collect a comprehensive dataset spanning a wide range of repository popularity levels.

Dataset Features¶

Variable Description Type
Name The name of the GitHub repository String
Description A brief textual description of the repository’s purpose or focus String
URL The unique web address linking to the repository on GitHub String
Created At The date and time the repository was initially created on GitHub (ISO 8601 format) DateTime
Updated At The most recent date and time of modification to the repository (ISO 8601 format) DateTime
Homepage URL to the repository's associated homepage or landing page String
Size The storage size of the repository in bytes Integer
Stars The number of stars, indicating popularity or community interest Integer
Forks Count of times the repository has been forked Integer
Issues The number of currently open issues in the repository Integer
Watchers The number of users monitoring the repository for updates Integer
Language The primary programming language used in the repository String
License Information on the repository's software license, identified by a license identifier String
Topics List of topics or tags associated with the repository List
Has Issues Boolean indicating whether the repository has issue tracking enabled Boolean
Has Projects Boolean indicating if GitHub Projects is used for task organization Boolean
Has Downloads Boolean indicating if downloadable files are offered in the repository Boolean
Has Wiki Boolean indicating if the repository has a Wiki for additional documentation Boolean
Has Pages Boolean indicating if GitHub Pages is enabled, allowing for a related website Boolean
Has Discussions Boolean indicating if GitHub Discussions is enabled for community collaboration Boolean
Is Fork Boolean indicating if the repository is a fork of another repository Boolean
Is Archived Boolean indicating if the repository is archived (read-only, not actively maintained) Boolean
Is Template Boolean indicating if the repository is set up as a template for creating similar projects Boolean
Default Branch The name of the repository's default branch String

Table 1

Like Borges et al., we do expect more forks to be associated with a higher number of stars, as that is what Borges et al. observed in 2016. We also do not expect commits or contributors to play a large role in how popular a repository is because large projects can have many issues and contributors while having no public advertisement to make them popular. A factor we expect to contribute to popularity that Borges et al. did not explore is whether the repository has a Wiki or not. We think that a Wiki is a positive indicator for repositories that are popular enough to have enough people wanting to use and understand it, such that the time investment to create a Wiki was more efficient than answering questions.

Our study seeks to explore and infer the underlying factors that drive popularity. By analyzing a larger and more recent dataset, we aim to uncover new insights and provide a more comprehensive understanding of how various repository features influence their popularity. Ultimately, our findings will help developers and stakeholders make more informed decisions about their repository’s popularity.

Methods and Results¶

Exploratory Data Analysis (EDA)¶

In [2]:
# Main developer: Junhao
# Contributor:

# read the dataset
repositories <- read.csv("repositories.csv")

head(repositories, n=3)
A data.frame: 3 × 24
NameDescriptionURLCreated.AtUpdated.AtHomepageSizeStarsForksIssues⋯Has.IssuesHas.ProjectsHas.DownloadsHas.WikiHas.PagesHas.DiscussionsIs.ForkIs.ArchivedIs.TemplateDefault.Branch
<chr><chr><chr><chr><chr><chr><int><int><int><int>⋯<chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
1freeCodeCamp freeCodeCamp.org's open-source codebase and curriculum. Learn to code for free.https://github.com/freeCodeCamp/freeCodeCamp 2014-12-24T17:49:19Z2023-09-21T11:32:33Zhttp://contribute.freecodecamp.org/ 38745137407433599248⋯TrueTrue TrueFalseTrueFalseFalseFalseFalsemain
2free-programming-books:books: Freely available programming books https://github.com/EbookFoundation/free-programming-books2013-10-11T06:50:37Z2023-09-21T11:09:25Zhttps://ebookfoundation.github.io/free-programming-books/ 1708729839357194 46⋯TrueFalseTrueFalseTrueFalseFalseFalseFalsemain
3awesome 😎 Awesome lists about all kinds of interesting topics https://github.com/sindresorhus/awesome 2014-07-11T13:42:37Z2023-09-21T11:18:22Z 144126999726485 61⋯TrueFalseTrueFalseTrueFalseFalseFalseFalsemain

Table 2

Idea: We will manually remove missing values and some columns which we don't think they would be very helpful for the future prediction model, and/or they are text data without categorical levels / they have too many category levels needed to be handled.

In [3]:
# Main developer: Xiaofei
# Contributor:

summary(repositories %>% select(Size, Forks, Issues, Stars, Watchers))
      Size               Forks              Issues             Stars       
 Min.   :        0   Min.   :     0.0   Min.   :    0.00   Min.   :   167  
 1st Qu.:      378   1st Qu.:    39.0   1st Qu.:    3.00   1st Qu.:   237  
 Median :     2389   Median :    79.0   Median :   10.00   Median :   377  
 Mean   :    54283   Mean   :   234.2   Mean   :   37.92   Mean   :  1115  
 3rd Qu.:    15282   3rd Qu.:   174.0   3rd Qu.:   28.00   3rd Qu.:   797  
 Max.   :105078627   Max.   :243339.0   Max.   :26543.00   Max.   :374074  
    Watchers     
 Min.   :   167  
 1st Qu.:   237  
 Median :   377  
 Mean   :  1115  
 3rd Qu.:   797  
 Max.   :374074  

Table 3

From Table 3, we observe that all the maximum values are quite far away from the data clusters for each numerical variables.

Correlation Heatmap for numerical variables¶

In [4]:
# Main developer: Xiaofei
# Contributor: Amy

library(corrplot)

# filter the numerical variables
numeric_data <- repositories %>% select(Size, Forks, Issues, Stars, Watchers)

# Calculate the correlation matrix
correlation_matrix <- cor(numeric_data)

# Create a heatmap of the correlation matrix
corrplot(correlation_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, 
         title = "Correlation Heatmap with Stars", tl.cex = 0.8, mar = c(0, 0, 2, 0),
         addCoef.col = "black", number.cex = 0.7)
corrplot 0.92 loaded

No description has been provided for this image

Figure 1

In Figure 1, we focus on the correlation between Stars and other numerical variables. Notably, Forks and Watchers show strong correlations with our target variable, with Watchers being especially prominent. We find that Watchers is identical to the Stars variable for every sample in the dataset. To avoid redundancy and ensure the model's validity, we decide to remove the Watchers variable from the analysis, as it is essentially a duplicate of our response variable.

Then we check the proportion of TRUE in each categorical variable:

In [5]:
# Main developer: Xiaofei
# Contributor:

binary_columns <- c("Has.Issues", 'Has.Projects', 'Has.Downloads', 'Has.Wiki', 'Has.Pages', 'Has.Discussions', 'Is.Fork', 'Is.Archived', 'Is.Template')  

result <- repositories %>%
  select(all_of(binary_columns)) %>%
  summarize(across(everything(), ~ mean(. == "True", na.rm = TRUE)))

print(result)
  Has.Issues Has.Projects Has.Downloads  Has.Wiki Has.Pages Has.Discussions
1  0.9679206    0.8660925     0.9908245 0.7935069 0.1772738       0.1244111
  Is.Fork Is.Archived Is.Template
1       0  0.06580043 0.006389836

Here we find Is.Fork has no FALSE value, so we will remove it.

Faceted Bar Plot for categorical variables¶

In [6]:
# Main developer: Xiaofei
# Contributor:

repositories %>%
  select(Has.Issues, Has.Projects, Has.Downloads, Has.Wiki, Has.Pages, Has.Discussions, Is.Archived, Is.Template, Stars) %>%
  gather(key = "Variable", value = "Value", -Stars) %>%
  group_by(Variable, Value) %>%
  summarize(mean_stars = mean(Stars, na.rm = TRUE), .groups = "drop") %>%  
  ggplot(aes(x = Value, y = mean_stars, fill = Value)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ Variable) + 
  scale_fill_brewer(palette = "Paired") +
  labs(title = "Average Stars by Different Variables", x = "Category", y = "Average Stars") +
  theme_minimal()
No description has been provided for this image

Figure 2

From Figure 2, we observe that there appears to be significant difference in the average counts of stars between TRUE and FALSE for the variables Has.Projects, Has.Downloads, Has.Wiki, Has.Pages, Has.Discussions.

Data Wrangling¶

We convert the categorical variables to numerical variables for now, in order to ensure the use of regsubsets() function in the following forward selection step.

In [7]:
# Main developer: Junhao
# Contributor: Amy, Xiaofei

# Remove rows with missing values
repo_tidy <- repositories %>%
    drop_na()

#Filter relevant columns and convert categorical columns to binary(1/0)
data_repo_clean <- repo_tidy %>% 
select(Size, Stars, Forks, Issues, Has.Issues, Has.Projects, Has.Downloads, Has.Wiki, Has.Pages, Has.Discussions, Is.Archived, Is.Template) %>%
mutate(across(c(Has.Issues, Has.Projects, Has.Downloads, Has.Wiki, Has.Pages, Has.Discussions, Is.Archived, Is.Template),
                ~ ifelse(. == "True", 1, 0)))

head(data_repo_clean,n=3)
A data.frame: 3 × 12
SizeStarsForksIssuesHas.IssuesHas.ProjectsHas.DownloadsHas.WikiHas.PagesHas.DiscussionsIs.ArchivedIs.Template
<int><int><int><int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
13874513740743359924811101000
2 1708729839357194 4610101000
3 144126999726485 6110101000

Table 4

Methods¶

Data Splitting¶

To evaluate our model's predictive performance and analyze the predictor coefficients identified through forward selection, we split the original dataset into two subsets: 60% for the training set and the remaining 40% for the test set.

In [8]:
# Main developer: Junhao
# Contributor: 

# Split the dataset
set.seed(123)

repo_split <- 
    data_repo_clean %>%
    initial_split(prop = 0.6, strata = Stars)

training_repo <- training(repo_split)
testing_repo <- testing(repo_split)

Baseline model: Intercept-only¶

We set the intercept-only model as the null model.

In [9]:
# Main developer: Xiaofei
# Contributor: Amy

# fit the null model
repo_null_OLS <- lm(Stars ~ 1, data = training_repo)
repo_null_OLS_results <- tidy(repo_null_OLS)

repo_null_OLS_results
A tibble: 1 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept)1110.30411.214499.007010

Table 5

In [10]:
# Main developer: Xiaofei
# Contributor: Amy

# predit on the testing dataset
testing_repo <- testing_repo |>
    add_predictions(repo_null_OLS, var="pred_null_OLS")

# RMSE of the null model
rmse_null <- 
    testing_repo %>% 
    mutate(pred_error = Stars - pred_null_OLS) %>%
    summarise(RMSE = sqrt(mean(pred_error ^ 2))) %>%
    pull()

repo_RMSE_models <- tibble(
  Model = "OLS Null Regression",
  RMSE = rmse_null)

repo_RMSE_models
A tibble: 1 × 2
ModelRMSE
<chr><dbl>
OLS Null Regression3938.218

Table 6

The RMSE of the null model is 3938.218.

Stepwise Forward Selection¶

We apply the forward selection method to identify the predictors to include in our final model.

In [11]:
# Main developer: Xiaofei
# Contributor:

# forward selection
repo_forward_sel <- regsubsets(x = Stars ~ ., nvmax = length(colnames(data_repo_clean))-1,
                                  data = training_repo,
                                  method = "forward")
repo_forward_summary <- summary(repo_forward_sel)

repo_forward_summary
Subset selection object
Call: regsubsets.formula(x = Stars ~ ., nvmax = length(colnames(data_repo_clean)) - 
    1, data = training_repo, method = "forward")
11 Variables  (and intercept)
                Forced in Forced out
Size                FALSE      FALSE
Forks               FALSE      FALSE
Issues              FALSE      FALSE
Has.Issues          FALSE      FALSE
Has.Projects        FALSE      FALSE
Has.Downloads       FALSE      FALSE
Has.Wiki            FALSE      FALSE
Has.Pages           FALSE      FALSE
Has.Discussions     FALSE      FALSE
Is.Archived         FALSE      FALSE
Is.Template         FALSE      FALSE
1 subsets of each size up to 11
Selection Algorithm: forward
          Size Forks Issues Has.Issues Has.Projects Has.Downloads Has.Wiki
1  ( 1 )  " "  "*"   " "    " "        " "          " "           " "     
2  ( 1 )  " "  "*"   "*"    " "        " "          " "           " "     
3  ( 1 )  " "  "*"   "*"    " "        " "          " "           " "     
4  ( 1 )  " "  "*"   "*"    " "        " "          " "           "*"     
5  ( 1 )  " "  "*"   "*"    " "        " "          " "           "*"     
6  ( 1 )  " "  "*"   "*"    " "        "*"          " "           "*"     
7  ( 1 )  " "  "*"   "*"    " "        "*"          " "           "*"     
8  ( 1 )  " "  "*"   "*"    " "        "*"          " "           "*"     
9  ( 1 )  " "  "*"   "*"    " "        "*"          "*"           "*"     
10  ( 1 ) " "  "*"   "*"    "*"        "*"          "*"           "*"     
11  ( 1 ) "*"  "*"   "*"    "*"        "*"          "*"           "*"     
          Has.Pages Has.Discussions Is.Archived Is.Template
1  ( 1 )  " "       " "             " "         " "        
2  ( 1 )  " "       " "             " "         " "        
3  ( 1 )  " "       "*"             " "         " "        
4  ( 1 )  " "       "*"             " "         " "        
5  ( 1 )  "*"       "*"             " "         " "        
6  ( 1 )  "*"       "*"             " "         " "        
7  ( 1 )  "*"       "*"             "*"         " "        
8  ( 1 )  "*"       "*"             "*"         "*"        
9  ( 1 )  "*"       "*"             "*"         "*"        
10  ( 1 ) "*"       "*"             "*"         "*"        
11  ( 1 ) "*"       "*"             "*"         "*"        

Table 7

In [12]:
# Main developer: Xiaofei
# Contributor:

repo_forward_summary_df <- tibble(
    n_input_variables = 1:11,
    RSQ = repo_forward_summary$rsq,
    RSS = repo_forward_summary$rss,
    ADJ_R2 = repo_forward_summary$adjr2,
    Cp = repo_forward_summary$cp,
    BIC = repo_forward_summary$bic,
)
repo_forward_summary_df
A tibble: 11 × 6
n_input_variablesRSQRSSADJ_R2CpBIC
<int><dbl><dbl><dbl><dbl><dbl>
10.33333081.395552e+120.33332575468.749781-52287.46
20.35344701.353442e+120.35343701412.898041-56228.60
30.35835621.343166e+120.3583413 424.619700-57200.16
40.35959231.340578e+120.3595724 177.279653-57437.17
50.36002321.339676e+120.3599984 92.360638-57512.24
60.36031091.339074e+120.3602812 36.316745-57558.49
70.36038231.338925e+120.3603476 23.922696-57561.12
80.36044651.338790e+120.3604069 12.958281-57562.31
90.36047971.338721e+120.3604351 8.271656-57557.23
100.36048071.338719e+120.3604311 10.072526-57545.66
110.36048101.338718e+120.3604265 12.000000-57533.97

Table 8

In Table 8,

  • The forward algorithm identifies an 8-variable model as optimal based on BIC.
  • The forward algorithm identifies an 9-variable model as optimal based on Cp.

As the result, we use Cp as our selection criterion and choose the 9-variable model with the lowest Cp value.

In [13]:
# Main developer: Xiaofei
# Contributor: Amy

# select 9 variables with lowest Cp
cp_min = which.min(repo_forward_summary$cp) 

selected_var <- names(coef(repo_forward_sel, cp_min))[-1]
selected_var
  1. 'Forks'
  2. 'Issues'
  3. 'Has.Projects'
  4. 'Has.Downloads'
  5. 'Has.Wiki'
  6. 'Has.Pages'
  7. 'Has.Discussions'
  8. 'Is.Archived'
  9. 'Is.Template'

These 9 variables were selected as predictors for fitting the prediction model

In [14]:
# Main developer: Xiaofei
# Contributor:

# subset only the predictors selected from the full dataset
training_subset <- 
    training_repo %>% 
    select(all_of(selected_var), Stars)

testing_subset <- 
    testing_repo %>% 
    select(all_of(selected_var), Stars)

head(training_subset, n=3)
A data.frame: 3 × 10
ForksIssuesHas.ProjectsHas.DownloadsHas.WikiHas.PagesHas.DiscussionsIs.ArchivedIs.TemplateStars
<int><int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int>
1 32361100000237
2125621110100237
3 28 91110100237

Table 9

Fit the model¶

In [15]:
# Main developer: Xiaofei
# Contributor:Junhao

# Convert the binary vairables (numerical so far) to `factor` type.
training_subset <- training_subset %>%
  mutate(across(c(Has.Projects, Has.Downloads, Has.Wiki, Has.Pages, 
                  Has.Discussions, Is.Archived, Is.Template), as.factor))

testing_subset <- testing_subset %>%
  mutate(across(c(Has.Projects, Has.Downloads, Has.Wiki, Has.Pages, 
                  Has.Discussions, Is.Archived, Is.Template), as.factor))

# Fitting the model using the selected predictors
# Use test set to check the coefficients of the selected predictors
model = lm(Stars ~ ., data = testing_subset)
tidy(summary(model))%>% mutate_if(is.numeric, round, 4)
A tibble: 10 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept) 1097.4547115.6185 9.49200.0000
Forks 1.9253 0.0097199.25520.0000
Issues 1.7305 0.0619 27.95330.0000
Has.Projects1 -278.2259 39.1953 -7.09840.0000
Has.Downloads1 -190.8266113.9027 -1.67530.0939
Has.Wiki1 -230.7400 33.2725 -6.93480.0000
Has.Pages1 202.5394 28.0746 7.21430.0000
Has.Discussions1 783.8735 33.0922 23.68750.0000
Is.Archived1 -178.1925 43.0245 -4.14170.0000
Is.Template1 -527.0365131.6852 -4.00220.0001

Table 10

Interpretation of the predictors¶

In Table 10:

Significant Positive Predictors:

Forks: The number of forks is the strongest predictor, indicating repositories that are frequently forked tend to be more popular. Issues: A higher number of issues correlates with more Stars, possibly reflecting an active and engaged user base. Has.Pages and Has.Discussions: Repositories with GitHub Pages and Discussions enabled tend to have more Stars.

Significant Negative Predictors:

Has.Projects and Has.Wiki: These features negatively influence the number of Stars, suggesting they might not be relevant to popularity or could be associated with less prominent projects. Is.Archived and Is.Template: Archived repositories and templates are less likely to receive Stars, likely because they are not actively maintained.

Non-Significant, potentially Negative Predictor:

Has.Downloads: This features does not have a significant coefficient, with a p-value of ~0.09 at the significant level of 0.05. Therefore, it seems negatively influence the number of Stars but we are not very confident with this estimated coefficent value for this feature.

Predict on the testing dataset - Compare RMSE¶

In [16]:
# Main developer: Amy
# Contributor:

# RMSE of the reduced model
rmse_red <- 
    testing_subset %>%
    add_predictions(model, var = 'pred_reduced_reg') %>%
    mutate(pred_red_error = Stars - pred_reduced_reg) %>%
    summarize(RMSE = sqrt(mean(pred_red_error^2))) %>%
    pull()

# RMSE of the full model
repo_full_OLS <- lm(Stars ~ ., data = training_repo)
testing_repo <- testing_repo |>
    add_predictions(repo_full_OLS, var="pred_full_OLS")
rmse_full <- 
    testing_repo %>% 
    mutate(pred_error = Stars - pred_full_OLS) %>%
    summarise(RMSE = sqrt(mean(pred_error ^ 2))) %>%
    pull()

# compare RMSE of the three models
repo_RMSE_models_expanded <- 
    bind_rows(
        repo_RMSE_models,
        tibble(Model = "OLS Reduced Regression",
               RMSE = rmse_red),
        tibble(Model = "OLS Full Regression",
               RMSE = rmse_full)
    )

repo_RMSE_models_expanded
A tibble: 3 × 2
ModelRMSE
<chr><dbl>
OLS Null Regression 3938.218
OLS Reduced Regression3131.400
OLS Full Regression 3146.908

Table 11

From the above comparison, we see that the OLS Reduced model has a lower RMSE (3146.949), therefore we decide to use it as our final model for prediction. Although the RMSE of the full regression model (3146.908) is very close to that of the reduced model, we decided against using the full model to avoid unnecessary computational costs and to reduce model complexity. A simpler model with fewer predictors is not only more efficient to compute but also tends to be more interpretable and easier to maintain. Thus, the reduced model strikes an optimal balance between predictive accuracy and practical considerations.

Assumptions Checking¶

Since the original dataset contains over 215k observations, we draw a random sample of 5,000 to reduce computational cost, time, and memory usage, and use this sample to check the model assumptions.

In [17]:
# Main developer: Amy
# Contributor:Junhao

# draw a random sample of size 5000
set.seed(123)
training_sample  <- rep_sample_n(training_subset, size = 5000, replace = FALSE)

# fit a model based on the sample
model_sample <- lm(Stars ~ ., data = training_sample)

# plots to check Linearity, Homoscedasticity and Normality of residuals
plot(model_sample, which = 1)  # Residuals vs Fitted
plot(model_sample, which = 2)  # Q-Q plot
plot(model_sample, which = 3) # Scale-Location Plot
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Figures 3, 4, 5

Residual vs Fitted Plot: Most points are clustered near the left end of the red line, with a few outliers. Suggests small residuals for most observations, but potential issues with large errors for some data points.

Q-Q Plot: A slight downward left tail and an obvious upward right tail are observed. Indicates deviations from normality with heavy tails.

Scale-Location Plot: Almost all of the points are clustered near the lower left corner of the plot, with a few outliers scattered outside this region. Suggests that the residuals are not evenly spread across the range of fitted values, indicating potential heteroscedasticity

In [18]:
# Main developer: Amy
# Contributor:

# Check multicollinearity using VIF
vif_values <- vif(model)

print(vif_values)
          Forks          Issues    Has.Projects   Has.Downloads        Has.Wiki 
       1.096599        1.103553        1.569138        1.014743        1.595606 
      Has.Pages Has.Discussions     Is.Archived     Is.Template 
       1.012963        1.043344        1.013778        1.001229 

The VIF values for all predictors are below 2, indicating no significant multicollinearity in our model.

Discussion¶

Based on the results, we conclude that each variable significantly impacts stars while keeping the other 8 variables constant; the number of stars increases as forks increase. Borges et al. (2016) state that stars and forks have a strong positive relationship. The number of times users have changed or shared the repository indicates that the repository is more popular and has more stars. Additionally, it was observed that a higher number of issues also leads to more stars. A positive correlation may be because more issues indicate that the repository receives more attention, leading to more users experimenting with different operations and producing more bugs or issues. Consequently, more operating means more visits, which explains the increase in stars. Repositories with discussion boards tend to have higher stars because these boards provide valuable support and solutions to users. They enable users to share knowledge, troubleshoot issues, and collaborate effectively, creating a supportive environment. This positive user experience potentially leads users to show more interest in the repository.

The presence of downloadable files in the repository, whether it is used for an organization and whether it has a Wiki, shows a significant negative relationship with stars. These three variables, which are expected to contribute positively to the stars, appear to have opposite effects on the model's coefficients. This may be due to the database having most variable values clustered near the center, with some extreme outliers affecting the coefficient direction. Alternatively, repositories with downloadable files, organization use, and Wikis are less popular because these features are not typically associated with trending repositories. Hence, the repository that has these features always comes with fewer stars. Moreover, Is.Archived and Is.Template also negatively correlate with stars. The reason might be that template-based or read-only repositories are less appealing to users as they provide limited actionable information or interactive functionality.

Adjusted $R^2$ and Cp values were compared through forward selection to determine the best model. The highest adjusted $R^2$ values from variable combinations indicate that each variable or predictor has a specific influence on the response, considering their weight. Adjust $R^2$ was made to prevent the inclusion of insignificant variables while removing those with notable impacts, unlike $R^2$. Mallows's Cp, on the other hand, judges a good model based on a lower residual sum of squares (RSS) and fewer predictors, with lower Cp values indicating a better model. Additionally, using the p-value to determine which variables are statistically significant to the response for Has.Download, although the decrease is reasonable, the p-value of 0.09 is greater than the significance value of 0.05, suggests that the presence of downloadable files is not that confident to consider. Hence, the best 9 variables are chosen to build the linear regression model.

Ultimately, the model was finalized by comparing and ensuring no multicollinearity among variables using VIF checks. To validate the assumptions, the original dataset was sampled to create a representative and random subset, facilitating assumption testing through visualization. From the assumption graphs, we observed issues such as non-linearity, heteroscedasticity, and deviations from normality. Non-linearity suggests that the relationship between stars and explanatory variables may not be linear. These violations are potentially caused by numerous outliers or insufficient dataset completeness. Transformations such as logarithmic, square root, or quadratic terms may be necessary to address heteroscedasticity. For normality violations, alternative modeling methods beyond linear regression could be explored.

In summary, this study analyzed factors influencing repositories and identified general trends. A higher number of issues and forks increases a repository's popularity, while repositories tailored for organizational use or featuring internal downloads and wikis tend to reduce their popularity. Read-only or template-based repositories also face lower appeal. Due to assumption violations, future research could focus on re-weighting variables to construct a more accurate model and achieve improved analyses.

References¶

Borges, H., Hora, A., & Valente, M. T. (2016). Understanding the Factors That Impact the Popularity of GitHub Repositories. 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). https://doi.org/10.1109/icsme.2016.31

Borges, H., Hora, A., & Valente, M. T. (2018). What’s in a GitHub Star? Understanding Repository Starring Practices in a Social Coding Platform. Journal of Systems and Software, 146, 112–129. https://doi.org/10.1016/j.jss.2018.09.016

Canard. (2023, December). Most Popular Github Repositories (Projects). Retrieved Nov 12, 2024 from https://www.kaggle.com/datasets/donbarbos/github-repos/data